In [ ]:
 

Filling/Imputing NUll values for Numerical data: Pima dataset Analysis¶

In data science,sometimes in real world datasets, there can be values missing due to many reasons(). An easy solution for it is just remove those datapoint from the dataset. But if the missing values are significant , then we have to fill those missing values before applying any machine learning model .

Ignoring missing values can have a negative effect on the accuracy of the model. So its good practice to fill/impute those missing values as much as possible.

Missing values can be both categorical or numerical.They are different methods to handle missing values .For our example, we focus on handling numerical type values.

Motivation/Problem Description:¶

To demonstrate filling up missing values in numerical data, we will look at an example using the pima.csv dataset. The dataset contains data about the Pima people which predicts their posibility of having diabetes based on the features Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age.

The pima dataset has lots of missing values. In this example, we will find the missing values in the dataset and fill the missing values based on their distribution.

Load the dataset¶

Our first task is to load the pandas library and the pima.csv dataset into a dataframe. The steps are:

  • import the pandas library
  • load the data into a variable pima using read_csv method in pandas
  • view the variable pima.
In [1]:
import pandas as pd

pima= pd.read_csv("pima_updated_28Jan.csv")

pima.head()
Out[1]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148.0 72.0 35.0 NaN 33.6 0.627 50 1
1 1 85.0 66.0 29.0 NaN 26.6 0.351 31 0
2 8 183.0 64.0 NaN NaN 23.3 0.672 32 1
3 1 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 0 137.0 40.0 35.0 168.0 43.1 2.288 33 1

Check for Null values in dataset:¶

Our next task is to check where the null values exists in the dataset. We check all the columns to find the null values in the dataset. For this task we use the isnull() method.

The steps are:

  • use the isnull() method to get all the null values in pima variable.
  • use the sum() method to get the null value count in each column.
In [3]:
pima.isnull().sum()
Out[3]:
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

From the output shown, If we look closely at the null values in each column, columns SkinThickness and Insulin have a lot of null values.

For this reason , in this task, we focus on filling the missing/NULL values in the Insulin and SkinThickness columns.

Observe Column data distribution:¶

To fill missing values in numerical data, aggregation methods such as mean , median are used in general cases.

For Skin_Thickness, to fill the missing values, our first task is to see the distribution of its data.

For this, we can use a histogram to observe its distribution. We will use the plotly library to use the histogram.

The steps are:

  • import the plotly library
  • from plotly, call the histogram() method in variable fig to draw the histogram
  • inside the method, the parameters are,
    • variable pima, where the data is stored
    • the column SkinThickness, whose histogram will be shown
    • the number of bins in the histogram, nbins
  • display the histogram using the show method
In [3]:
import plotly.express as px

fig= px.histogram(pima, x='SkinThickness', nbins=50)

fig.show()

From the histogram above, we can see the data distribution of data is a normal distribution. In this type of scenarios, to impute the missing values in data, we usually use the mean of the data(Skin_Thickness) column since the values are centered not skewed.

Similarly, for Insulin value, we look at the distribution of its data.

The steps are:

  • call the histogram method in variable fig2 to draw the histogram
  • inside the method, the parameters are,
    • variable pima, where the data is stored
    • the column Insulin, whose histogram will be shown
    • the number of bins in the histogram, nbins
  • display the histogram using the show method
In [4]:
fig2= px.histogram(pima, x='Insulin', nbins=50) 

fig2.show()

From the histogram, we can see that the distibution of Insulin data is right skewed meaning most of the datapoints are on the left side. In this type of scenario, its better to fill the missing values using the median of the data(Insulin) since most of the value will also be on the left side of the Insulin data.

Fill the column data based on distribution:¶

Based on the distribution of data, we fill their missing values with mean and median of their existing values.

Fill SkinThickness missing values:¶

The steps are:

  • Select the SkinThickness column from pima.
  • use the fillna method to fill the NULL values
  • the method parameters will be
    • the SkinThickness column data, using the mean method
    • keyword inplace and set value as True to make the changes permanent
  • view the new SkinThickness values using the head method
In [6]:
pima['SkinThickness'].fillna(pima['SkinThickness'].mean(), inplace=True)

pima.head(10)
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148.0 72.0 35.00000 NaN 33.6 0.627 50 1
1 1 85.0 66.0 29.00000 NaN 26.6 0.351 31 0
2 8 183.0 64.0 29.15342 NaN 23.3 0.672 32 1
3 1 89.0 66.0 23.00000 94.0 28.1 0.167 21 0
4 0 137.0 40.0 35.00000 168.0 43.1 2.288 33 1
5 5 116.0 74.0 29.15342 NaN 25.6 0.201 30 0
6 3 78.0 50.0 32.00000 88.0 31.0 0.248 26 1
7 10 115.0 NaN 29.15342 NaN 35.3 0.134 29 0
8 2 197.0 70.0 45.00000 543.0 30.5 0.158 53 1
9 8 125.0 96.0 29.15342 NaN NaN 0.232 54 1
10 4 110.0 92.0 29.15342 NaN 37.6 0.191 30 0
11 10 168.0 74.0 29.15342 NaN 38.0 0.537 34 1
12 10 139.0 80.0 29.15342 NaN 27.1 1.441 57 0
13 1 189.0 60.0 23.00000 846.0 30.1 0.398 59 1
14 5 166.0 72.0 19.00000 175.0 25.8 0.587 51 1
15 7 100.0 NaN 29.15342 NaN 30.0 0.484 32 1
16 0 118.0 84.0 47.00000 230.0 45.8 0.551 31 1
17 7 107.0 74.0 29.15342 NaN 29.6 0.254 31 1
18 1 103.0 30.0 38.00000 83.0 43.3 0.183 33 0
19 1 115.0 70.0 30.00000 96.0 34.6 0.529 32 1

Fill Insulin missing values:¶

The steps are:

  • Select the Insulin column from pima.
  • use the fillna method to fill the NULL values
  • the method parameters will be
    • the Insulin column data, using the median method
    • keyword inplace and set value as True to make the changes permanent
  • view the new Insulin values using the head method
In [7]:
pima['Insulin'].fillna(pima['Insulin'].median(), inplace=True)

pima.head(10)
Out[7]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148.0 72.0 35.00000 125.0 33.6 0.627 50 1
1 1 85.0 66.0 29.00000 125.0 26.6 0.351 31 0
2 8 183.0 64.0 29.15342 125.0 23.3 0.672 32 1
3 1 89.0 66.0 23.00000 94.0 28.1 0.167 21 0
4 0 137.0 40.0 35.00000 168.0 43.1 2.288 33 1
5 5 116.0 74.0 29.15342 125.0 25.6 0.201 30 0
6 3 78.0 50.0 32.00000 88.0 31.0 0.248 26 1
7 10 115.0 NaN 29.15342 125.0 35.3 0.134 29 0
8 2 197.0 70.0 45.00000 543.0 30.5 0.158 53 1
9 8 125.0 96.0 29.15342 125.0 NaN 0.232 54 1
10 4 110.0 92.0 29.15342 125.0 37.6 0.191 30 0
11 10 168.0 74.0 29.15342 125.0 38.0 0.537 34 1
12 10 139.0 80.0 29.15342 125.0 27.1 1.441 57 0
13 1 189.0 60.0 23.00000 846.0 30.1 0.398 59 1
14 5 166.0 72.0 19.00000 175.0 25.8 0.587 51 1
15 7 100.0 NaN 29.15342 125.0 30.0 0.484 32 1
16 0 118.0 84.0 47.00000 230.0 45.8 0.551 31 1
17 7 107.0 74.0 29.15342 125.0 29.6 0.254 31 1
18 1 103.0 30.0 38.00000 83.0 43.3 0.183 33 0
19 1 115.0 70.0 30.00000 96.0 34.6 0.529 32 1

Check if any missing values exists after imputing:¶

After filling the missing values, lets check if there are any missing values in the SkinThickness and Insulin columns.

The steps are:

  • select the SkinThickness,Insulin column data from the pima variable.
  • use the isnull() method to get the NULL values in those columns.
  • use the sum() method to sum the null value counts in those columns.
In [12]:
pima[['SkinThickness','Insulin']].isnull().sum()
Out[12]:
SkinThickness    0
Insulin          0
dtype: int64

Conclusion¶

We have filled out missing values of numerical type data based on their distribution.

In [ ]: